Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!”
library(tidyverse)
library(dslabs)
library(ggrepel)
data(gapminder)
gapminder |> as_tibble()
## # A tibble: 10,545 × 9
## country year infant_mortality life_expectancy fertility population gdp
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1960 115. 62.9 6.19 1636054 NA
## 2 Algeria 1960 148. 47.5 7.65 11124892 1.38e10
## 3 Angola 1960 208 36.0 7.32 5270844 NA
## 4 Antigua… 1960 NA 63.0 4.43 54681 NA
## 5 Argenti… 1960 59.9 65.4 3.11 20619075 1.08e11
## 6 Armenia 1960 NA 66.9 4.55 1867396 NA
## 7 Aruba 1960 NA 65.7 4.82 54208 NA
## 8 Austral… 1960 20.3 70.9 3.45 10292328 9.67e10
## 9 Austria 1960 37.3 68.8 2.7 7065525 5.24e10
## 10 Azerbai… 1960 NA 61.3 5.57 3897889 NA
## # ℹ 10,535 more rows
## # ℹ 2 more variables: continent <fct>, region <fct>
filter(gapminder, year == 1962) |>
ggplot(aes(fertility, life_expectancy)) +
geom_point()
Show continent with color
filter(gapminder, year == 1962) |>
ggplot( aes(fertility, life_expectancy, color = continent)) +
geom_point()
Show a scatterplot for two years and each continent separately.
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(year~continent)
Combine the continents and use color to denote them
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_grid(. ~ year)
facet_wrapLet’s show same plot for 5 years
years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder |>
filter(year %in% years & continent %in% continents) |>
ggplot( aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(~year)
ggplot defaults to using the same y-axis and x-axis range
Here is how to change that:
filter(gapminder, year%in%c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(. ~ year, scales = "free")
But it makes comparisons difficult.
Time series plot:
gapminder |>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_point()
here it is with lines:
gapminder |>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_line()
Using lines is particularly helpful with comparisons.
Let’s try it:
countries <- c("South Korea","Germany")
gapminder |> filter(country %in% countries) |>
ggplot(aes(year,fertility)) +
geom_line()
What happened?
Need to let ggplot know there are two lines:
countries <- c("South Korea","Germany")
gapminder |> filter(country %in% countries & !is.na(fertility)) |>
ggplot(aes(year, fertility, group = country)) +
geom_line()
Use color to distinguish them:
countries <- c("South Korea","Germany")
gapminder |> filter(country %in% countries & !is.na(fertility)) |>
ggplot(aes(year,fertility, col = country)) +
geom_line()
For trend plots we recommend labeling the lines rather than using legends.
The package geomtextpath
library(geomtextpath)
gapminder |>
filter(country %in% countries) |>
ggplot(aes(year, life_expectancy, col = country, label = country)) +
geom_textpath() +
theme(legend.position = "none")
covert gdp to dollars per day
gapminder <- gapminder |> mutate(dollars_per_day = gdp/population/365)
Here is a histogram of per day incomes from 1970:
past_year <- 1970
gapminder |>
filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black")
We use the color = "black" argument to draw a boundary
and clearly distinguish the bins.
Here is the distribution if we apply a log base 2 transform:
gapminder |>
filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(log2(dollars_per_day))) +
geom_histogram(binwidth = 1, color = "black")
In a way this provides a close-up of the mid to lower income countries.
log base 10 usueful for changes in order of magnitude
log base 2 useful in most other circumstances
log natural base rarely if ever useful for dataviz
For population log 10 probably best:
filter(gapminder, year == past_year) |>
summarize(min = min(population), max = max(population))
## min max
## 1 46075 808510713
Here is the histogram of the transformed values:
gapminder |>
filter(year == past_year) |>
ggplot(aes(log10(population))) +
geom_histogram(binwidth = 0.5, color = "black")
In the above, we quickly see that country populations range between ten thousand and ten billion.
Transforming the values make plot more readable. mid values easier to interpret.
----1----x----2--------3----
for log transformed data, we know that the value of x is
about 1.5. If the scales are logged:
This saves you the conversion in your head, but figuring out
x is harder.
----10---x---100------1000---
gapminder |>
filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2")
gapminder |>
filter(year == past_year & !is.na(gdp)) |>
mutate(region = reorder(region, dollars_per_day, FUN = median)) |>
ggplot(aes(dollars_per_day, region)) +
geom_point() +
scale_x_continuous(trans = "log2")
gapminder <- gapminder |>
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe",
"Northern America",
"Australia and New Zealand") ~ "West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America",
"South America") ~ "Latin America",
continent == "Africa" &
region != "Northern Africa" ~ "Sub-Saharan",
TRUE ~ "Others"))
We turn this group variable into a factor to control the
order of the levels:
gapminder <- gapminder |>
mutate(group = factor(group, levels = c("Others", "Latin America",
"East Asia", "Sub-Saharan",
"West")))
In the next section we demonstrate how to visualize and compare distributions across groups.
p <- gapminder |>
filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
scale_y_continuous(trans = "log2") +
xlab("") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data. One way to avoid this is by showing the data.
p + geom_point(alpha = 0.5)
Boxplots are very efficient in summarizing distributions but can hide bimodalities:
If not too many categories, we can plot density plots instead of boxes.
library(ggridges)
p <- gapminder |>
filter(year == past_year & !is.na(dollars_per_day)) |>
ggplot(aes(dollars_per_day, group)) +
scale_x_continuous(trans = "log2")
p + geom_density_ridges()
We can add points to show data:
p + geom_density_ridges(jittered_points = TRUE)
or tick marks:
p + geom_density_ridges(jittered_points = TRUE,
position = position_points_jitter(height = 0),
point_shape = '|', point_size = 3,
point_alpha = 1, alpha = 0.7)
past_year <- 1970
present_year <- 2010
years <- c(past_year, present_year)
gapminder |>
filter(year %in% years & !is.na(gdp)) |>
mutate(west = ifelse(group == "West", "West", "Developing")) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ west)
The relationship between these two variables is almost perfectly linear and the graph shows a dramatic difference. While in the West less than 0.5% of infants die, in Sub-Saharan Africa the rate is higher than 6%!
Note that the plot uses a new transformation, the logistic transformation.
\[f(p) = \log \left( \frac{p}{1-p} \right)\]
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
library(tidyverse)
library(RColorBrewer)
library(dslabs)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
the_disease <- "Measles"
dat <- us_contagious_diseases |>
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |>
mutate(rate = count / population * 100000 * 52 / weeks_reporting) |>
mutate(state = reorder(state, rate))
We can now easily plot disease rates per year. Here are the measles data from California:
dat |> filter(state == "California" & !is.na(rate)) |>
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 100,000") +
geom_vline(xintercept=1963, col = "blue")
dat |> ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept=1963, col = "blue") +
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position="bottom",
text = element_text(size = 8)) +
ggtitle(the_disease) +
ylab("") + xlab("")